4D-A181  611  A  PROGRAMMABLE  SVSTOLIC  ARRAV  FOR  FACTORIAL  DATA  1/1 

ANALVSI5  PART  1  MATRIX  COMPUTATIONS)  VALE  UNIV  NEU 
HAVEN  CT  DEPT  OF  COMPUTER  SCIENCE  T  PORTA  JUN  87 
UNCLASSIFIED  VALEU/DCS/RR-542  AFOSR-86-0098  F/G  12/4  NL 


-  /  - 


Abstract:  / 

-la'lhis  paper  *e  present^rmal  systolic  algorithms  for  Factorial  Data  Aoalysia-.matrix  products 
of  several  types  such  as  XX*  where  X  is  a  rectangular  matrix  of  size  k  x  n,RX  where  R  is  upper 
triangular  of  size  k,AB  where  A  and  B  are  square  dense  matrices  of  size  k,Cholesky  factorizations 
and  triangular  matrix  inversions. 

All  these  algorithms  are  built  to  run  efficiently  on  the  same  asynchronous  MIMD  triangular 
systolic  array  with  orthogonal  connections:SAJELDA(Systolic  Array  for  Data  Analysis). 
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1.  Introduction 

The  computations  needed  for  Factorial  Data  Analysis  come  down  to  a  unique  computation 
kernel  [2,3].  For  each  of  the  factorial  techniques  there  is  a  preliminary  computation  to  do  on  the 
matrix  of  observations  to  obtain  the  Jfc  x  n  matrix  of  data  X  and  possibly  on  the  A,  (respectively 
B) symmetric  positive  definite  matrix  of  size  n, (respectively  k), which  represents  the  chosen  norm 
on  /^(respectively  Rk).  Then  we  compute  the  Cholesky  factorization  of  A  and  B  when  they  are 
not  diagonakwe  define  i?x  and  Rb  upper  triangular  of  sizes  n  and  k  such  that  A  =  R*Ra  and 
B  =  RqRb  (where  T  stands  for  transposition). 

After  that  we  let  E  =  RbXR\  and  W  =  EET  and  we  define  the  eigenvalue  decomposition 
of  W:(A,-,tu*)  for  «  =  l,...,Jb  such  that  u/fttf,-  =  1.  Finally  we  compute  /<  =  RgWi  and  y,-  = 
A  71/2AXTfi. 

We  can  therefore  restrict  ourselves  to  the  following  set  of  computations: 

1.  matrix  product  A  =  XXT  where  X  is  Jfc  x  n. 

2.  Cholesky  factorization  A  =  JZj  R&  of  order  k. 

3.  triangular  matrix  inversion  iZJ1. 

4.  matrix  products  RX  where  R  is  upper  triangular  of  size  k  and  X  rectangular  of  size  kxn  and 

matrix  products  AB  where  A  and  B  are  both  square  of  order  k. 

5.  dominant  eigenvalues  computation  of  the  symmetric  positive  definite  matrix  W  of  order  k. 

The  purpose  of  that  work  was  to  devise  efficient  systolic  algorithms  for  the  computations  listed 
above  from  1  to  5  all  running  on  the  same  programmable  systolic  array:the  Systolimag  machine 
built  by  Gerard  Chevalier[ll].  This  machine  composed  of  9  processors  allows  to  create  or  to  simu¬ 
late  experimental, two-dimensional  systolic  arrays  of  sizes  from  2  up  to  12:SAJ£DA(Systolic  Array 
for  Factorial  Data  Analysis). 

Each  processor  has  a  microprocessor  MC6809,an  arithmetic  coprocessor  AM  D951,  a  i?OM(Read 
Only  Memory)  of  4Kbytes,a  RAM  (Random  Access  Memory)  of  up  to  16  Kbytes  and  4  communica¬ 
tion  channels  to  let  the  processor  communicate  with  its  4  nearest  neighbors  from  North  .South, East, 
and  West[ll]. 

We  wanted  to  use  the  Systolimag  machine  as  an  asynchronous  MIMD  (Multiple  Instruction  Multiple 
Data)  machine.That  is  to  say  to  use  the  fact  that  the  processors  are  programmable  and  they  don’t 
have  to  do  all  the  same  thing  and  that  their  programs  are  ruled  by  data  flow. 

As  our  systolic  array  had  to  be  of  a  fixed  size  a  (a  <  12)  we  realized  the  computations  1  to  5 
with  matrices  of  size  a  x  a  or  s  x  n  by  block  partitioning  (see  [10,3]).  We  wanted  also  to  keep  the 
same  structure  for  all  our  algorithms  in  order  not  to  lose  time  by  changing  the  connections  with 
the  nearest  neighbors  between  two  algorithms. 

We  chose  a  triangular  array  with  orthogonal  connections  (four  nearest  neighbors  north-east-south¬ 
west)  as  our  stucture  to  deal  with  Factorial  Data  Analysis.The  first  reason  is  because  Data  Analysis 
involve  mainly  symmetric  or  triangular  matrices  so  we  only  need  to  use  a(a  +  l)/2  processors  (one 
per  matrix  element)  instead  of  s2.The  second  reason  is  that  this  structure  is  especially  well  suited 
for  implementing  1  and  2  in  cascade  [10,12, 13].The  orthogonal  connections  are  the  simpliest  to  build 
because  every  processor  has  4  "natural”  (hardware)  communication  channels  with  its  neighbors  and 
they  are  mainly  used  for  the  computations  1  to  5. 

In  section  2  we  present  the  systolic  implementation  of  the  matrix  product  A  =  XXT  with  X 
rectangular  of  size  axn  followed  in  cascade  by  the  Cholesky  factorization  of  size  a  of  A:  A  =  RT  R.ln 
section  3  we  detaile  the  upper  triangular  matrix  inversion  with  R  stored  in  the  array  and  the  matrix 
products  described  as  computations  4. 
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Figure  1:  progression  of  the  matrix  X  in  the  array  for  the 
computation  of  A  =  XX? . 

2.  Systolic  implementation  of  A  =  XX?  and  Cholesky  factorisation  of  A, A  =  R?Ri 

2.1.  matrix  product  A  =  XXT  with  X  of  sise  ixn: 

Set  down  A  —  have: 

The  processor  ij  computes  Xfj  =  Here  are  now  the  detail  of  the  operations  carried  out  by 
the  array  in  the  case  a  =  3  and  n  =  4. The  figure  1  presents  the  progression  of  the  matrix  X  in  the 
array  with  the  time  steps  which  correspond  to  .At  each  clock  cycle  T,each  processor  receiving  input 
data  performs  a  multiplication  on  them,  adds  this  result  to  its  memory  contents  and  broadcasts 
the  data  on  its  outputa.There  are  two  types  of  cells, the  diagonal  cells  which  only  receive  one  input 
data  and  the  others  which  receive  two.lt  corresponds  to  two  data  treatments  which  are  described 
respectively  by  figures  2a  and  2b. 
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We  get  Ai,i  =  J2i=i  x\ ji  »t  the  instant  nT. 

We  get  Aj^  =  n  x\j  at  the  instant  (n  +  2)7. 


We  get  A,,,  =  J2imi,n  xmJ  at  instant  (n  +  2(«  -  1  ))T. 
So  we  need  (n  +  2s  -  2)  time  steps  to  compute  XXT . 
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Figure  2:  (2a)  program  of  a  diagonal  cell. 
(2b)  program  of  a  non  diagonal  cell. 


2.2.  Cholesky  factorisation  of  A, A  =  RtRi 

We  can  link  the  matrix  product  A  =  XXT  together  with  the  Cholesky  factorization  of  A.As 
soon  as  the  first  A are  computed  the  computation  of  R  upper  triangular  such  that  A  =  RT  R  can 
begin. 


Set  down  R  =  (ftj)i<«j<*-We  have: 


ri,i  =  [Xi,i  -  r,M1/2  for  •'  =  1, 


V 


rij  =  [A.J  -  for  j  =  and  «'  <  j. 


There  are  Cholesky  algorithms  different  from  this  one  presented  here.Those  are  adaptations 
of  the  LU  factorization  of  a  square  matrix  (chapter  4  of  [7], chapter  2  of  [4], [6])  to  the  symmetric 
case  [lj.The  Brent-Luk  model  [1]  is  an  hexagonal  array  of  «(«  + 1)/2  processors  and  needs  4s  times 
steps.The  algorithm  of  Schreiber  (12], [13]  can  be  implemented  on  a  triangular  orthogonal  array 
composed  of  s(s  +  l)/2  processors  and  needs  only  3s  time  steps.The  arithmetic  operations  are 
roughly  the  same.The  winning  of  s  time  steps  comes  from  the  fact  that  in  our  case  the  matrix 


R  stays  in  the  array  for  a  susequent  utilisation  (computation  of  R~lpr  matrix  product  of  R  by 
another  matrix).  In  the  Brent  and  Luk  model  (1]  the  computation  of  R  is  achieved  only  when  the 
matrix  R  is  out  of  the  array  which  takee  s  additional  times  steps. 

In  our  array  the  non  diagonal  processors  ji  compute  rtj  for  j  >  t,and  the  diagonal  processors 
compute  r<4.We  can  see  mi  the  figure  S  how  we  obtain  the  uj  from  the  Xtj  =  A ^  at  each  time 
step  .As  for  the  matrix  product  XXT  there  are  taro  types  of  cells, the  diagonal  cells  and  the  others 
which  perform  different  tasks  showed  in  the  figures  4a, 4b, 5a  and  5b. These  tasks  can  be  decomposed 
in  two  steperthe  first  one  is  composed  of  (j?  -  1)  time  steps  where  j  stands  for  the  column  index  of 
the  array  and, the  second  one, erf  one  time  step.  During  the  first  step, every  processor  receiving  input 
data  performs  a  multiplication  on  them^ubotr  acts  this  result  to  its  memory  content  and  broadcasts 
the  data  an  its  outputa.During  the  second  step, the  diagonal  proceeaors  compute  the  square  root  of 
their  memory  content  and  send  it  downwards  and, the  non  diagonal  processors, divide  their  memory 
content  by  the  last  data  they  receive  and  sand  this  result  to  the  right. 


Figure  3:  Choleaky  factorization  of  A:computation  of  the 
upper  triangular  matrix  R  such  that  A  =  RTR. 


We  get  ri,i  =  [A,,!]1/’  at  the  instant  (n  +  1)T. 

We  get  rj,i  =  (Ai(J  -  rj  j]1/1  at  the  instant  (n  +  •  +  1)T  =  (n  +  4)T. 
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We  get  rt<t  =  (A,.,  -  £|=.i,«-i  rit, J1^1  the  instant  (n  +  2s  -  1  )T  =  (n  +  l  +  S(s  -  1  ))T. 
So  we  need  (3«  —  2)  time  stepe  to  compute  R  from  A. 


Figure  4:  (4a)  first  step  of  the  program  of  a  diagonal  cell. 
(4b)  first  step  of  the  program  of  a  non  diagonal  cell. 
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Figure  5:  (Sa)  second  step  of  the  program  of  a  diagonal  cell. 
(5b)  second  step  of  the  program  of  a  non  diagonal  cell. 
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3.  Other  matrix  computations: 

We  defined  the  triangular  structure  of  the  array  from  the  two  functions  to  perform  in  cas- 
cadercomputation  of  A  =  XXT  and  then  the  Cholesky  factorization  of  A.It  remains  to  show  that 
the  triangular  array  allows  us  to  execute  efficiently  the  other  functions:matrix  inversion  and  matrix 
products. 


3.1.  triangular  matrix  inversion: 


Sometimes, in  Factorial  Data  Analysis, we  have  A-1  and  B~l  which  are  matrices  of  empirical 
variance  of  the  form  XXT instead  of  A  and  B,  and  for  the  computation  of  E  =  RbXR^jng  need 
the  Cholesky  factorizations  of  A  and  B.To  get  these  we  proceed  in  two  steps:first,we  define  the 
Cholesky  factorizations  of  A~l  and  =  R^lRA-i  and  B~l  =  R^-i  Rb~1  and  second, we 

inverse  the  upper  triangular  matrices  RA- 1  and  rB~  »  .Then  we  have:  E  =  1X(Ra-i)  T. 

The  step  of  matrix  inversion  is  always  preceded  by  a  Cholesky  factorization.  That  is  why  we 
can  always  compute  R~x  from  R, R  being  already  stored  in  the  array. We  present  here  a  method  for 
our  triangular  array  of  s(s  +  l)/2  processors  which  can  be  executed  in  (2s  -  1)  time  steps  from  R 
stored  in  the  array. 

The  Li  and  Wah  algorithm  [8]  can  be  done  on  a  triangular  array  of  s(s  +  l)/2  processors 
and  in  (2s  —  1)  time  steps  but  the  interconnections  are  this  time  hexagonal  (every  processor  has  6 
neighbors)  .Furthermore  the  matrix  R  and  the  matrix  W  of  the  intermediate  results  which  will  be 
R~1  at  the  exit  of  the  array  ,are  both  introduced  in  the  array. 

For  our  method  and  that  of  Li  and  Wah  [8]  the  arithmetic  operations  are  roughly  the  same, only 
the  diagonal  processors  tasks  and  the  data  flow  movements  differ  .These  algorithms  are  optimal  in 
that  sense  that  they  minimize  the  number  of  processors  and  time  steps.(see  [8]) 


Set  down  R  =  (rij)i<i<j<t  and  R~*  =  (rJJ)i<,<,<«.We  have: 


r'i,i 


=  r^  for  i  =  l,... 


*  and  r\j  =  1  r,,,r{;]  •  r',.  for  j  =  1, . . . ,  s  and  t  <  j. 


The  figure  6  presents  the  movements  of  the  rj  •  and  the  time  steps  they  are  obtained  from 
the  r,j  in  the  array  for  a  =  4.The  diagonal  processors  **  having  their  memory  content  equal  to 
r,-,-  .inverse  it  and  send  this  value  downwards  and  to  the  left  at  the  right  time  steps,  (see  figure 
7a)  .The  non  diagonal  processors  ji  compute  the  r\j  from  the  contained  in  their  memory. They 
execute  tasks  which  can  be  splitted  up  into  three  steps, the  first  and  the  third, of  one  time  step  and 
the  second,  of  (j  —  i  —  1)  time  steps  corresponding  to  (J  -  i  -  1)  successive  couple  of  data  inputs 
where  t  and  j  stand  respectively  for  the  row  and  column  indices  of  the  array.These  three  steps  are 
described  by  the  figures  7b, 8a  and  8b.  During  the  first  step  every  processor  receiving  an  input 
data  sends  first  its  memory  content  downwards, then  multiplies  the  data  with  it, changes  the  sign 
and  sends  the  data  on  its  output.During  the  second  step  the  processors  perform  a  multiplication 
on  their  input  data^substract  this  result  to  their  memory  content  and  send  the  data  on  their 
outputs.Finally, during  the  third  step, the  processors  multiply  the  last  input  data  they  receive  with 
their  memory  content, send  the  data  on  their  output  and  then  this  result  to  the  left. 
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Figure  7:  (7a)  program  of  a  diagonal  cell. 

(7b)  first  step  of  the  program  of  a  non  diagonal  cell. 


Figure  8:  (8a)  second  step  of  the  program  of  a  non  diagonal 
cell.  (8b)  third  step  of  the  program  of  a  non  diagonal  cell. 


3.2.  matrix  products: 

There  are  a  lot  of  systolic  algorithms  for  matrix  products.They  depend  on  the  chosen  structure 
of  the  array, (hexagonal  or  orthogonal  connections),  and  on  the  type  of  cells, (with  local  memory  or 
not).  Suppose  we  want  to  compute  the  product  C  =  AB  with  A  and  B  square  matrices  of  size 
s;the  rows  of  A  and  the  columns  of  B  will  have  to  cross  in  the  array  to  get  the  matrix  C. 

When  the  cells  have  no  local  memory  the  intermediate  results  52i=i  p  <  p  <  s  have  to 

go  through  the  array  at  the  same  time  as  the  matrices  A  and  B  ([9], [6], [8]). 

In  the  model  of  Melkemi  and  Tchuente  [9]  the  matrix  A  (respectively  B )  is  introduced  column 
by  column  (respectively  row  by  row)  and  moves  horizontally  along  the  rows  of  the  array  from  the 
left  to  the  right  (respectively  from  the  right  to  the  left).  C  moves  vertically  from  the  top  to  the 
bottom.The  connections  between  cells  are  orthogonal  and  the  array  can  be  square  (composed  of  s2 
cells)  or  rectangular  (composed  ofjxm  cells  with  m  >  a)  .The  product  is  executed  after  (3a  —  2) 
time  steps.If  the  array  is  of  size  s(2s  -  1)  we  only  have  to  introduce  the  matrix  A  (respectively  B) 
column  by  column  (respectively  row  by  row)  .But  if  the  array  is  of  size  ax  m  with  a  <  m  <  2a-  l,we 
have  to  repeat  some  of  the  elements  of  the  columns  of  A  (respectively  of  the  rows  of  B)  because 
some  of  the  cells  have  to  perform  the  tasks  of  several  cells.  By  using  this  principle  we  can  execute 
a  matrix  product  of  two  square  matrices  of  size  a  on  an  array  of  a2  cells  in  (3a  —  2)  time  steps. 
This  algorithm  is  optimal  (it  minimizes  the  number  of  time  steps  and  processors). Nevertheless  it 
has  two  drawbacks: 

1.  it  is  necessary  to  apply  a  non  trivial  treatment  to  the  matrix  A  and  B  (dupplicate  some  columns 
or  rows)  before  introducing  them  into  the  array. 

2.  when  we  want  to  compute  several  matrix  products  in  cascade, as  it  is  our  case  when  we  have 
to  deal  with  block  partioning,the  duration  of  introducing  the  matrices  A  and  B  into  the  array 
is  practically  doubled  which  delays  for  about  a  time  steps  for  the  next  product. 

The  Kung  and  Leiserson  model  [6], (7]  is  used  for  band  matrix  products.  If  A  and  B  are  of  band 
widths  wi  and  u/2  the  product  C  —  AB  is  performed  on  an  array  with  hexagonal  connections, a 
parallelogram  shape  and  composed  of  wi  X  u/2  cells. A  and#  are  introduced  diagonal  by  diagonal 
along  the  diagonals  of  the  array  from  the  top  and  the  result  C  moves  vertically  from  the  bo t  com 
to  the  top.The  product  is  obtained  after  3»-|-mm(wi,tBj)  time  steps.However  this  array  cannot  be 
used  for  dense  matrices  because  the  number  of  cells  of  the  array  becomes  too  high  ((2s  -  l)2  cells 
would  be  necessary). 

The  Li  and  Wah  algorithm  [8]  deals  with  dense  matrices  on  an  array  with  hexagonal  connec- 
tions.lt  is  of  the  same  type  of  array  than  that  of  Kung  and  Leiserson  but  it  allows, thanks  to  a 
different  data  introduction, to  reduce  the  number  of  time  steps  and  cells.The  array  has  this  time  an 
hexagonal  shape  and  is  composed  of  3 s(s  -  1)  cells.The  matrices  B  and  C  move  along  the  diago¬ 
nals  of  the  array, B  from  the  top,C  from  the  bottom,  and  A  moves  vertically  from  the  bottom  to 
the  top.The  matrices  A  and  B  are  introduced  counterdiagonals  by  counterdiagonals, the  (2s  -  1) 
counterdiagonals  along  the  diagonals  of  the  array  and  we  get  C  diagonal  by  diagonal.The  matrix 
product  is  computed  after  2s  time  steps.The  algorithm  is  simple  to  do  and  very  fast  but  the  cost 
of  processors  is  still  high. 

When  the  cells  have  a  local  memory  the  arrays  are  the  most  often  square,  with  orthogonal 
connections  and  composed  of  s2  cells.The  matrices  A  and  B  move  along  perpendicular  directions 
and  the  intermediate  results  XIi=i,p  <  p  <  a  are  stored  in  the  cell  of  the  ith  row  and  jth 

column  which  computes  Cij  ([5]). In  that  case  the  matrix  product  takes  (3s  -  2)  time  steps  and 
(s  -  1)  additional  time  steps  are  necessary  to  get  the  matrix  C  out  of  the  array. 


9 


Melkemi  and  Tchuente  [9j  use  again  the  same  technique  of  dupplication  of  some  elements  of 
A  and  B  in  order  to  activate  faster  the  cells  of  the  array  and  thus  save  (s  -  1)  time  steps.This 
algorithm  is  optimal  if  we  introduce  A  and  B  along  two  sides  of  the  array  but  it  has  the  same 
drawbacks  than  those  quoted  earlier. 

3.2.1.  matrix  product  RX  with: 

•  R  upper  triangular  of  size  a 

•  X  rectangular  of  size  a  x  n 

We  cannot  use  here  the  techniques  of  matrix  products  described  above  because  our  array  is  tri¬ 
angular  instead  of  being  square  or  rectangular, but  the  basic  operations  are  roughly  the  same. Instead 
of  moving  R  and  X  simultaneously  in  the  array  along  different  directions  we  introduce  the  matrix 
R  a  little  time  before  the  matrix  X  in  order  to  store  it  in  the  array.Every  is  then  memorized  in 
the  cell  yi.  When  X  is  introduced, its  elements  are  multiplied  by  the  memory  contents  of  the  cells 
forming  the  product  C  =  RX  which  moves  from  the  top  to  the  bottom. 

Set  down  R  =  (r,j)i<, =  (x,j)  i<,<,  and  C  =  (cjj)  k,<,  . 

l<j'<n  1<j<» 

The  figure  9  presents  the  movings  of  the  matrices  R  and  X  and  the  dates,  the  c,  j  are  computed 
in  the  array  for  a  =  3  and  n  =  4. There  are  once  more  two  types  of  cells  which  tasks  are  described 
by  the  figures  10a  and  10b.  Every  diagonal  cell  tt  memorizes  rj,,then  multiplies  its  memory  content 
by  the  input  data  it  receives  and  sends  this  result  downwards.  Every  non  diagonal  cell  ij  memorizes 
fj%i  .multiplies  its  memory  content  by  the  input  data  coming  from  the  left  and  broadcasts  it  to  the 
right, adds  this  product  with  the  input  data  coming  from  the  top  and  sends  this  result  downwards. 
Our  matrix  product  is  executed  in  (2s  +  n-  1)  time  steps  which  is  of  the  same  order  of  time  as 
the  product  algorithms  described  above  but  with  practically  twice  less  cells.This  product  takes 
place  in  the  computation  of  E  =  RbXR*  where  R*  and  Rb  are  upper  triangular  of  sizes  n  and 
k  respectively  and  X  is  k  x  n  (see  introduction). When  k  >  a  we  apply  block  partitioning  to  the 
matrices  R  and  X  (see  [10]). 

The  product  C  =  RX  is  followed  by  the  product  E  =  C(R!)T  which  can  be  performed  in  the 
same  way  but, instead  of  introducing  C  row  by  row  as  X,  we  introduce  C  column  by  column.This 
second  step  begins  at  the  instant  (n  +  1)T  just  after  the  entry  of  xi„  in  the  array  and  is  executed  in 
(2n  +  a  —  1)  time  steps. As  soon  as  we  get  the  first  elements  et  J  of  E,  the  computation  of  W  =  EET 
(see  introduction)  can  begin.This  last  step  begins  at  the  instant  (2n+2)T  and  as  it  can  be  performed 
in  (n  +  2s  —  2)  time  steps  (see  2.1.)  we  get  W  in  the  array  after  (3n  +  2s)  time  steps  and, out  the 
array,  after  (4n  +  2s  -  1)  time  steps. 
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Figure  9:  moving  of  the  matrices  and  C  in  the  array 
for  the  product  C  =  RX. 

We  get  C14  =  £J=1  #  •  xjj  at  the  instant  (*  +  1  )T. 

We  get  C14  =  2(b1^  r\ji  •  xj,j  at  the  instant  («  +  2 )T. 

We  get  ci*  =  2j=i,j  rU  ’  *i ,»  at  the  instant  (*  +  n)T. 

We  get  c,,n  =  r,,,  •  x,,»  at  the  instant  (2s  +  n  -  1)T. 

So  we  get  C  =  fUf  after  (2«  +  n  -  1)  time  steps. 
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Figure  10:  (10a)  program  of  a  diagonal  cell.The  cell  mem¬ 
orizes  the  value  v  it  receives. 

(10b)  program  of  a  non  diagonal  cell.The  cell  memorizes  the 
first  value  v  it  receives  and  then  broadcasts  the  following  val¬ 
ues  to  the  right. 

3.2.2.  matrix  product  C  =  AB  with: 

•  A  and  B  square  matrices  of  size  a. 

Our  array  being  composed  of  processors  with  local  memories, we  can  use  the  techniques  pre¬ 
sented  in  [5]  which  consist  of  sending  A  horizontally  by  rows,  B  vertically  by  columns  and  of  storing 
the  intermediate  results  of  the  product  £j=i  $  in  the  processor  ij.(aee  figure  11)  But  in  that 
way  we  only  get  the  elements  of  the  lower  triangular  part  of  C.  In  order  to  obtain  the  upper 
triangular  part  of  C  we  introduce  the  matrix  B  (respectively  A)  by  columns  (respectively  rows) 
again  into  the  array  but  that  time  horizontally  (respectively  vertically). The  matrices  A  and  B  are 
introduced  again  into  the  array  in  the  following  way: 

Every  ,/  =  1, . . .  ,a  of  the  jth  row  of  A  is  memorized  in  the  jth  diagonal  processor  during 
the  computation  °f  ciJ  —  ajjbtj  and  is  sent  again  to  the  bottom  of  the  array,  one  time  step 

after  b,j; the  bijj  =  1  ,...,#  of  the  ith  column  of  B  enter  the  array  one  time  step  after  a,,, .(see 
figure  12) 

Every  diagonal  processor  it  computes  Cjit  and  every  non  diagonal  processor  ij&j  and  then  cJit 
with  t  >  j; without  waiting  delay  between  these  two  operations.The  movings  of  the  matrices  A  and 
B  in  the  array  and  the  instants  when  the  Cij  are  computed  are  detailed  in  the  figures  11  and  12  and 
the  tasks  of  the  diagonal  and  non  diagonal  cells, in  the  figures  13a, 13b, 14a  and  14b.This  product  is 
executed  in  (4«  -  3)  time  steps  which  represents  a  saving  in  comparison  with  the  standard  matrix 
product  which  is  executed  in  (3s  -  2)  time  steps  on  an  array  of  *2  processors. 
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Figure  11:  Moving  of  the  matrices  A  and  £  for  the  matrix 
product  C  —  AB. 

First  phase:eomputation  of  the  lower  triangular  part  of 
C. 
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Figure  13:  (13a)  first  phase  of  the  program  of  a  diago¬ 
nal  cell:the  cell  memorizes  every  ui  it  receives  in  an  internal 
FIFO(First  In  First  Out). 

(13b)  second  phase  of  the  program  of  a  diagonal  cell:the  cell 
broadcasts  every  ui  memorized  in  its  internal  FIFO. 


Figure  14:  (14a)  first  phase  of  the  program  of  a  non  diagonal 
cell. 

(14b)  second  phase  of  the  program  of  a  non  diagonal  cell. 


The  reader  will  find  page  16  a  summary  table  of  the  studied  algorithms  as  well  as  comparisons 
with  other  algorithms.  We  put  a  star  *  on  those  we  chose. All  these  algorithms  are  linear  in  time. 


algorithm 

shape  and  connections 
of  the  array 

number  of  time  steps 

authors  and  special 
features 

matrix  product  XXT 
with  X  of  size  a  x  n 

triangular  array  of 
e(s  -(-  l)/2  proces¬ 
sors  with  orthogonal 
connections  [fig  1-2] 

(r*  +  2a  -  2)  time  steps 

Schreiber*  [12], [13] 

Cholesky  factorization 
of  XXT  .square  matrix 
of  size  a 

triangular  array  of 
a(a  +  l)/2  proces¬ 
sors  with  orthogonal 
connections  [fig  3-5] 

(38  —  2)  time  steps 

Schreiber*  [12],[13]:in 
cascade  with  the  ma¬ 
trix  product  XXT 

triangular  array  of 
a(a  -r  l)/2  proces¬ 
sors  with  hexagonal 
connections 

4a  time  steps 

Brent- Luk  [l] 

inversion  of  R  upper 
triangular  matrix  of 
size  a 

triangular  array  of 
a(a  +  l)/2  proces¬ 
sors  with  hexagonal 
connections 

(2s  -  1)  time  steps 

Li-Wah  [8] 

triangular  array  of 
a(a  +  l)/2  proces¬ 
sors  with  orthogonal 
connections  [fig  6-8] 

(2s  —  1)  time  steps 

Porta*  [10]  :is  computed 
from  R  stored  in  the 
array 

matrix  product  RX 
with  R  upper  trian¬ 
gular  of  size  8  and  X 
rectangular  of  size  s  x  n 

triangular  array  of 
a(a  +  l)/2  proces¬ 
sors  with  orthogonal 
connections  [fig  9-10] 

(2s  +  n  -  1)  time  steps 

Porta*  [10]  :R  is  stored 
in  the  array  before  X 
is  introduced 

matrix  product  of  two 
square  dense  matrices 
of  size  a 

square  array  of  a2  pro¬ 
cessors  with  orthogonal 
connections 

(3s  —  2)  time  steps 

Melkemi-Tchuente  [9] 

hexagonal  array  of 

3s(s  -  1)  processors 
with  hexagonal  connec¬ 
tions 

28  time  steps 

Li-Wah  [8] 

triangular  array  of 
s(a  +  l)/2  proces¬ 
sors  with  orthogonal 
connections  [fig  11-14] 

(4s  -  3)  time  steps 

Porta*  [10]:compu- 
tation  of  the  upper 
triangular  part  and 
then  of  the  lower  trian¬ 
gular  part. 

Table  1:  Matrix  computations 
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